Statistics for Data Science II
Recall the general linear model, y = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k + \varepsilon
Until now, we have discussed continuous predictors.
Now we will introduce the use of categorical, or qualitative, predictors.
This means that we will include predictors that categorize the observations.
If there are c classes in a categorical predictor, we include c-1 in the model.
e.g., undergraduate status: freshman, sophomore, junior, senior
e.g., AKC-recognized pug color: fawn, black
We will create indicator variables to include in our model.
The c-1 predictors included in the model will be binary indicators for category. x_i = \begin{cases} 1 & \textnormal{if category $i$} \\ 0 & \textnormal{if another category} \end{cases}
In the undergraduate status example, we could create four indicator variables, but we would only include three in our model.
While we will call them indicator variables, you will more often see them called dummy variables.
We could use mutate() and if_else() to define indicator variables.
However, if we do not need to condense categories, the dummy_cols() function from the fastDummies package works well.
Note that this will also create a column that indicates if the value is missing or not.
library(fastDummies)
colony <- colony %>% # overwrite data
mutate(quarter = case_when(months == "January-March" ~ "Q1", # convert to quarters
months == "April-June" ~ "Q2",
months == "July-September" ~ "Q3",
months == "October-December" ~ "Q4")) %>%
dummy_cols(select_columns = c("quarter")) # create indicators for quarterWe represent a categorical variable with c classes with c-1 indicator variables in a model.
The last indicator variable not included is called the reference group.
How do we choose a reference group?
It depends on the story being told / what is of interest.
It does not affect the usefulness of the model, only the interpretations.
Suppose we are interested in modeling the number of colonies lost as a function of the year, quarter, and total number of colonies.
What happens when we do not use indicator variables?
Call:
glm(formula = colony_lost ~ quarter + year + colony_max, data = colony)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.279e+05 2.204e+05 1.942 0.0524 .
quarterQ2 -4.058e+03 5.762e+02 -7.042 3.26e-12 ***
quarterQ3 3.302e+01 5.784e+02 0.057 0.9545
quarterQ4 -5.622e+02 5.785e+02 -0.972 0.3314
year -2.119e+02 1.092e+02 -1.941 0.0525 .
colony_max 1.167e-01 1.086e-03 107.466 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 49277184)
Null deviance: 6.2772e+11 on 1149 degrees of freedom
Residual deviance: 5.6373e+10 on 1144 degrees of freedom
(72 observations deleted due to missingness)
AIC: 23641
Number of Fisher Scoring iterations: 2
(Intercept) quarterQ2 quarterQ3 quarterQ4 year
4.279260e+05 -4.057731e+03 3.301536e+01 -5.621519e+02 -2.119223e+02
colony_max
1.166717e-01
R uses the “first group” as the reference group.
It is fine to do this for “ease” - if I am doing preliminary modeling, I do not always use indicator variables.
Warning: If you have a column stored as numeric, this method does not work.
R will treat it as a continuous varaible, which is not always correct.
User beware! Always be aware of how categorical variables are stored!
Circling back to indicator variables, let’s construct our model using those.
Call:
glm(formula = colony_lost ~ quarter_Q2 + quarter_Q3 + quarter_Q4 +
year + colony_max, data = colony)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.279e+05 2.204e+05 1.942 0.0524 .
quarter_Q2 -4.058e+03 5.762e+02 -7.042 3.26e-12 ***
quarter_Q3 3.302e+01 5.784e+02 0.057 0.9545
quarter_Q4 -5.622e+02 5.785e+02 -0.972 0.3314
year -2.119e+02 1.092e+02 -1.941 0.0525 .
colony_max 1.167e-01 1.086e-03 107.466 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 49277184)
Null deviance: 6.2772e+11 on 1149 degrees of freedom
Residual deviance: 5.6373e+10 on 1144 degrees of freedom
(72 observations deleted due to missingness)
AIC: 23641
Number of Fisher Scoring iterations: 2
\hat{y} = 427926.00 - 4057.73 x_{\text{Q2}} + 33.02 x_{\text{Q3}} - 562.15 x_{\text{Q4}} - 211.92 x_{\text{year}} + 0.12 x_{\text{tot. col.}}
Call:
glm(formula = colony_lost ~ quarter_Q1 + quarter_Q2 + quarter_Q3 +
year + colony_max, data = colony)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.274e+05 2.203e+05 1.940 0.0526 .
quarter_Q1 5.622e+02 5.785e+02 0.972 0.3314
quarter_Q2 -3.496e+03 5.987e+02 -5.839 6.84e-09 ***
quarter_Q3 5.952e+02 5.977e+02 0.996 0.3196
year -2.119e+02 1.092e+02 -1.941 0.0525 .
colony_max 1.167e-01 1.086e-03 107.466 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 49277184)
Null deviance: 6.2772e+11 on 1149 degrees of freedom
Residual deviance: 5.6373e+10 on 1144 degrees of freedom
(72 observations deleted due to missingness)
AIC: 23641
Number of Fisher Scoring iterations: 2
\hat{y} = 427363.8 + 562.16 x_{\text{Q1}} - 3495.58 x_{\text{Q2}} + 595.17 x_{\text{Q3}} - 211.92 x_{\text{year}} + 0.12 x_{\text{tot. col.}}
We are interested in knowing if the overall categorical variable is a significant predictor of the outcome.
If c = 2, we can use the results from summary().
If c \ge 3, we must request the Type III ANOVA.
Note: there are three types of sums of squares going into ANOVA. We are usually only interested in two:
We will use Anova() from the car package as follows:
car package directly as it has conflicts with the tidyverse package.In theory:
Hypotheses
H_0: \beta_{1}^* = ... = \beta_{s}^* = 0 in the model y = \beta_0 + \beta_1 x_1 + ... + \beta_q x_q + \beta_{1}^*x_1^* = ... = \beta_{s}^*x_s^* + \varepsilon
H_1: at least one \beta_i^* \ne 0 in the model y = \beta_0 + \beta_1 x_1 + ... + \beta_q x_q + \beta_{1}^*x_1^* = ... = \beta_{s}^*x_s^* + \varepsilon
Test Statistic and p-Value
F_0 = \frac{\left[ \text{SSReg(full)} - \text{SSReg(reduced)} \right]/s}{\text{MSE(full)}}
p = P(F_{s, n-q-s} \ge F_0)
Rejection Region
In practice:
Hypotheses
H_0: \beta_{1} = \beta_{2} = ... = \beta_{c-1} = 0 in the specified model
H_1: at least one \beta_i \ne 0 in the specified model
Test Statistic and p-Value
car::Anova(model_results, type = 3)^*Rejection Region
^* - note that the Anova() function produces the likelihood ratio test, which gives a \chi^2 statistic.
Both quarter (p<0.001) and total number of colonies (p<0.001). are significant predictors of the number of lost colonies.
The year (p=0.052) is not a significant predictor of the number of lost colonies.
Hypotheses
H_0: \beta_{\text{Q2}} = \beta_{\text{Q3}} = \beta_{\text{Q4}} = 0 in the specified model
H_1: at least one \beta_i \ne 0 in the specified model
Test Statistic and p-Value
\chi^2_0 = 65.6
p < 0.001
Rejection Region
Conclusion/Interpretation
Reject H_0.
There is sufficient evidence to suggest that there is a relationship between number of bee colonies lost and the quarter of the year.
Interpretations of categorical predictors are slightly different than continuous predictors.
For continuous variables, we interpret in terms of a 1-unit increase in x_i.
What is a “1-unit” increase of an indicator variable?
\beta_i for a categorical variable is the average difference between groups x_i=0 and x_i=1.
Suppose \hat{\beta}=-10 when modeling systolic blood pressure as a function of high school graduation (yes=1, no=0).
Those that are high school graduates (x=1, group of interest) have a systolic blood pressure 10 mmHg lower than those that did not graduate high school (x=0, reference group).
\hat{y} = 427926.00 - 4057.73 x_{\text{Q2}} + 33.02 x_{\text{Q3}} - 562.15 x_{\text{Q4}} - 211.92 x_{\text{year}} + 0.12 x_{\text{tot. col.}}
There are 4058 fewer colonies lost in Q2 as compared to Q1.
There are 33 more colonies lost in Q3 as compared to Q1.
There are 562 fewer colonies lost in Q4 as compared to Q1.
For every year increase in time, 212 fewer colonies are lost.
For every 1 colony increase in the total number of colonies, 1/10 of a colony is expected to be lost.
\hat{y} = 427363.8 + 562.16 x_{\text{Q1}} - 3495.58 x_{\text{Q2}} + 595.17 x_{\text{Q3}} - 211.92 x_{\text{year}} + 0.12 x_{\text{tot. col.}}
There are 562 more colonies lost in Q1 as compared to Q4.
There are 3496 fewer colonies lost in Q2 as compared to Q4.
There are 595 more colonies lost in Q3 as compared to Q4.
For every year increase in time, 212 fewer colonies are lost.
For every 1 colony increase in the total number of colonies, 1/10 of a colony is expected to be lost.
Now that we understand how to interpret, let’s revisit testing.
We’ve already discussed testing the overall significance of the term - this is called global significance.
Now, let’s focus on the individual terms.
summary().The individual tests tell us if there is a significant difference between the group of interest and the reference group.
We are not always interested in the individual tests…
Note: This is essentially post-hoc testing. We should adjust our \alpha!
Visualization of the model is more of an art than a science… but I still have guidelines.
The generic visualization for a model is to have a scatterplot of the data with a regression line (or multiple lines) overlaid.
Note that the outcome (y) will always be specified on the y-axis.
We will decide on a continuous predictor (one x_i) to vary on the x-axis.
Then we will create predicted values by holding all other variables in the model constant.
palmerpenguins package:\hat{y} = 427926.00 - 4057.73 x_{\text{Q2}} + 33.02 x_{\text{Q3}} - 562.15 x_{\text{Q4}} - 211.92 x_{\text{year}} + 0.12 x_{\text{tot. col.}}
We will let year vary - this will show us what happens over time.
We will create a line for each quarter.
We will plug in the median number of total colonies.
\hat{y} = 427926.00 - 4057.73 x_{\text{Q2}} + 33.02 x_{\text{Q3}} - 562.15 x_{\text{Q4}} - 211.92 x_{\text{year}} + 0.12 x_{\text{tot. col.}}
mutate(),# base: 427926.00 - 4057.73*x_Q2 + 33.02*x_Q3 - 562.15*x_Q4 - 211.92*year + 0.12*21000
colony <- colony %>%
mutate(p_q1 = 427926.00 - 4057.73*0 + 33.02*0 - 562.15*0 - 211.92*year + 0.12*21000,
p_q2 = 427926.00 - 4057.73*1 + 33.02*0 - 562.15*0 - 211.92*year + 0.12*21000,
p_q3 = 427926.00 - 4057.73*0 + 33.02*1 - 562.15*0 - 211.92*year + 0.12*21000,
p_q4 = 427926.00 - 4057.73*0 + 33.02*0 - 562.15*1 - 211.92*year + 0.12*21000)colony %>%
filter(colony_lost < 5000) %>%
ggplot(aes(x = year, y = colony_lost)) +
geom_point(color = "gray") +
geom_line(aes(x = year, y = p_q1), color = "black", linetype = "dashed", size = 1) +
annotate("text", x = 2021.2, y = 2175, label = "Q1", size=5) +
geom_line(aes(x = year, y = p_q4), color = "black", linetype = "dotted", size = 1) +
annotate("text", x = 2021.2, y = 1600, label = "Q4", size=5) +
scale_x_continuous(breaks = 2015:2021,
labels = paste0(c("2015", "2016", "2017", "2018", "2019", "2020", "2021"), "")) +
labs(y = "Number of Colonies Lost",
x = "Year") +
theme_bw()In this lecture, we have introduced the concept of categorical predictors.
Always remember that if there are c categories, there will be c-1 predictor terms in the model.
When c = 2, we can use the test results from summary() to discuss significance.
When c \ge 3, we must use a partial F test (i.e., car::Anova())
When including for visualization, must plug in 0’s and 1’s - no means, medians, etc.!